My course diary (this page) is located at https://hennake.github.io/IODS-project/.
My GitHub repository for this course is located at https://github.com/hennake/IODS-project.
Because Data Science is cool, and Open Data Science is even more so!
I’m quite familiar with R (5 years of experience), but I’d also like to learn how to use GitHub and RMarkdown.
How to get started with Git, GitHub and RMarkdown.
Working process was approved by the cat.
The dataset consists of data on the relationship between students’ learning approaches and their achievements in an introductory statistics course in Finland. More information about the original dataset is available at http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt.
After manipulation, the dataset only has the following variables:
* gender Gender: M (Male), F (Female)
* age Age (in years) derived from the date of birth
* attitude Global attitude toward statistics
* deep Average of points from questions related to deep learning approach
* stra Average of points from questions related to strategic learning approach
* surf Average of points from questions related to surface learning approach
* points Exam points
Students with zero exam points were removed from the dataset.
Data manipulation is described in detail here.
The data can be read into RStudio eg. with the read.table command as follows:
students2014 <- read.table("./data/learning2014.txt", sep="\t", dec=".", header=T)
Now the data is stored in the R object students2014.
How does the manipulated data look? A table of summary statistics is often a good starting point for an explorative analysis.
library(psych)
describe(students2014)
## vars n mean sd median trimmed mad min max range skew
## gender* 1 166 1.34 0.47 1.00 1.30 0.00 1.00 2.00 1.00 0.68
## age 2 166 25.51 7.77 22.00 23.99 2.97 17.00 55.00 38.00 1.89
## attitude 3 166 31.43 7.30 32.00 31.52 7.41 14.00 50.00 36.00 -0.08
## deep 4 166 3.68 0.55 3.67 3.70 0.62 1.58 4.92 3.33 -0.50
## stra 5 166 2.79 0.53 2.83 2.78 0.62 1.58 4.33 2.75 0.14
## surf 6 166 3.12 0.77 3.19 3.14 0.83 1.25 5.00 3.75 -0.11
## points 7 166 22.72 5.89 23.00 23.08 5.93 7.00 33.00 26.00 -0.40
## kurtosis se
## gender* -1.54 0.04
## age 3.24 0.60
## attitude -0.48 0.57
## deep 0.66 0.04
## stra -0.27 0.04
## surf -0.45 0.06
## points -0.26 0.46
A boxplot is a graphical tool for examining the distributions of individual variables. In this case, there is quite a lot of variation in the ranges of individual variables, and the picture is not as informative as one could wish.
boxplot(students2014)
We can draw a pairplot to check if there are associations between the variables.
library(GGally)
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
In order to check the gender distribution of students, we can use the table command. It seems that women are more interested in statistics than men! (Or perhaps there were more women to whom it was mandatory to take the course…)
table(students2014$gender)
##
## F M
## 110 56
The pairplot suggests that the variable points might be slightly bimodal, but in a histogram there seems to be no severe deviation from normality.
hist(students2014$points, main="Distribution of exam points", xlab="points")
All in all, the explorative analysis tells us that
Let’s fit a regression model and check, if the suggested association between a student’s attitude and his/her exam points is statistically sound. Clearly, the exam points is the responsive variable and the attitude is an explanatory variable. Adding multiple explanatory variables into the model might reveal more associations between the variables. So let’s include also gender and surf as explanatory variables.
fit1 <- lm(points ~ gender + attitude + surf, data = students2014)
summary(fit1)
##
## Call:
## lm(formula = points ~ gender + attitude + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7179 -3.3285 0.5343 3.7412 10.9007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97982 2.40303 3.737 0.000258 ***
## genderM -0.22362 0.92477 -0.242 0.809231
## attitude 0.35100 0.05956 5.893 2.13e-08 ***
## surf 0.89107 0.54409 1.638 0.103419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared: 0.2051, Adjusted R-squared: 0.1904
## F-statistic: 13.93 on 3 and 162 DF, p-value: 3.982e-08
The attitude is the only statistically significant (p < 0.05) explanatory variable (among the three that were tested) for the exam points. The other two are not statistically significant (p > 0.05), ie. we don’t have sufficient evidence to reject the null hypothesis that the regression coefficients of gender and surf (column Estimate) differ from zero. We should note that the borderline p-value of 0.05 is only a coarse rule of thumb, and cases on the borderline should be examined more carefully.
Now we need to refit the model to remove gender and surf.
fit2 <- lm(points ~ attitude, data = students2014)
summary(fit2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The explanatory variable attitude is statistically significant (p < 0.001), so we can reject the null hypothesis that attitude has no effect on exam points. Also the intercept (constant) term of the model, which is the expected mean value of exam points when the value of attitude is zero, is statistically significant. There is only one explanatory variable left in this model, so we don’t need to worry about potential collinearity between explanators. The estimate (coefficient) of attitude (0.35) means that when the attitude increases by 1 unit, the student’s exam score is expected to increase by 0.35 units on average. The multiple R-squared value indicates that the model explains about 19 % of variation in the response variable. The adjusted R-squared value is basically the same thing, but it accounts for the phenomenon of the R2 automatically increasing when extra explanatory variables are added to the model.
We can now visualize the regression model on the observed data as a red line:
plot(students2014$attitude, students2014$points, xlim = c(0, max(students2014$attitude)), xlab="Attitude toward statistics", ylab="Exam points")
abline(fit2, col="red")
Note to self: Using the p-values of individual model coefficients is not the recommended way of model selection, and it might be a better idea to calculate eg. an AIC value for each model. However, the results are the same in this case, as a smaller AIC means a better model fit.
paste0("AIC for model 1: ", AIC(fit1))
## [1] "AIC for model 1: 1030.9779677774"
paste0("AIC for model 2: ", AIC(fit2))
## [1] "AIC for model 2: 1029.9875233202"
Before declaring that a student’s attitude toward statistics is a good predictor for his/her exam performance, we have to make sure that the critical assumptions of linear regression model are met. The assumptions of the model are:
Let’s produce a couple of diagnostic plots to check these assumptions.
par(mfrow=c(2,2))
plot(fit2, which=c(1,2,5))
The first plot (residuals vs. fitted values) shows no detectable pattern in the residuals, indicating that assumptions 3 and 4 are valid. Also the assumption 1 seems to be valid, as the residuals align quite neatly along or near the straight line in the QQ-plot. No influential outliers show up in the last plot. The assumption 2 cannot be assessed based on these diagnostic plots, but it is mostly to be worried if there is a possibility of temporal or spatial autocorrelation in the data (eg. time series or spatial data). Thus, the assumptions of the linear regression model seem to be valid for our model.
There is a statistical association between a student’s attitude toward statistics and his/her exam performance: the more positive the attitude, the better the expected performance. This association can be modeled with a linear regression model.
The joined data set used in this analysis exercise combines two student alcohol consumption data sets. The following adjustments have been made in the data wrangling exercise:
More info: Original data source and R-code for data manipulation
First, let’s read the data into R and list the variable names:
alc <- read.table("./data/alc.csv", sep=";", dec=".", header=T)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The purpose of the analysis was to study the relationships between high/low alcohol consumption and some of the other variables in the data. I chose the following variables for the analysis and set these hypotheses:
Response variable:
Explanatory variables:
Next, I limited the data used for the analysis to include only the five variables mentioned above.
library(dplyr)
dd <- select(alc, one_of(c("sex", "Pstatus", "absences", "G3", "high_use")))
str(dd)
## 'data.frame': 382 obs. of 5 variables:
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ absences: int 5 3 8 1 2 8 0 4 0 0 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ high_use: logi FALSE FALSE TRUE FALSE FALSE FALSE ...
Now we have some hypotheses produced with the Stetson-Harrison method. Will the data shoot them down straight away? Let’s see and produce some summaries and explorative plots of the variables and their associations.
First, a basic summary table:
# Basic summary table
library(psych)
describe(dd, skew=F)
## vars n mean sd min max range se
## sex* 1 382 1.48 0.50 1 2 1 0.03
## Pstatus* 2 382 1.90 0.30 1 2 1 0.02
## absences 3 382 4.50 5.46 0 45 45 0.28
## G3 4 382 11.46 3.31 0 18 18 0.17
## high_use* 5 382 NaN NA Inf -Inf -Inf NA
Sex and parents’ cohabitation status are binary factors, and their associations to high alcohol use are easy to present as crosstabulations. Absences and final grade are integer variables with a wide range of values and not so suitable for crosstabulation.
tab1 <- table(high_use=dd$high_use, sex=dd$sex)
tab2 <- table(high_use=dd$high_use, parents_status=dd$Pstatus)
tabp1 <- prop.table(tab1, 2)
tabp2 <- prop.table(tab2, 2)
list(addmargins(tab1), tabp1,
addmargins(tab2), tabp2)
## [[1]]
## sex
## high_use F M Sum
## FALSE 156 112 268
## TRUE 42 72 114
## Sum 198 184 382
##
## [[2]]
## sex
## high_use F M
## FALSE 0.7878788 0.6086957
## TRUE 0.2121212 0.3913043
##
## [[3]]
## parents_status
## high_use A T Sum
## FALSE 26 242 268
## TRUE 12 102 114
## Sum 38 344 382
##
## [[4]]
## parents_status
## high_use A T
## FALSE 0.6842105 0.7034884
## TRUE 0.3157895 0.2965116
We might visualize these associations with stacked proportional bar plots:
layout(matrix(c(1,1,2,2,3), nrow=1, ncol=5, byrow=T))
par(mai=c(0.6,0.3,0.6,0.3))
barplot(tabp1, main="Sex vs high use", col=c("#F8766D","#00BFC4"))
barplot(tabp2, main="Pstatus vs high use", col=c("#F8766D","#00BFC4"))
par(mai=c(0.7,0.1,0.7,0.1))
plot.new()
legend("topleft", legend=c("FALSE","TRUE"), fill=c("#F8766D","#00BFC4"), title="High use")
The associations between absences and alchol use, and final grade and alcohol use can be visualized more easily with boxplots:
library(ggplot2)
library(gridExtra)
g1 <- ggplot(dd, aes(x = high_use, y = G3, fill=high_use)) + ggtitle("Grade vs high use")
g2 <- ggplot(alc, aes(x = high_use, y = absences, fill=high_use)) + ggtitle("Absences vs high use")
bp3 <- g1 + geom_boxplot() + ylab("grade") + theme(legend.position="none", plot.title = element_text(hjust = 0.5))
bp4 <- g2 + geom_boxplot() + ylab("absences") + theme(legend.position="none", plot.title = element_text(hjust = 0.5))
grid.arrange(bp3, bp4, ncol=2, nrow=1)
Based on the explorative analysis, males seem to be more prone to drink heavily than females. Heavy drinkers do get on average lower grades than those who don’t drink much. Heavy drinkers are also slightly more inclined to have a high number of absences than others. However, there seems to be no association between parents’ cohabitation status and student’s alchol use. It seems that my hypotheses about sex’s, absences’ and final grade’s associations with alcohol consumption might get some support from the explorative analysis.
Let’s fit a logistic regression model to see if these findings hold.
# First model
fit1 <- glm(high_use ~ sex + Pstatus + absences + G3, family=binomial, data=dd)
summary(fit1)
##
## Call:
## glm(formula = high_use ~ sex + Pstatus + absences + G3, family = binomial,
## data = dd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2443 -0.8411 -0.6244 1.0629 2.1686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.98141 0.61643 -1.592 0.1114
## sexM 0.98330 0.24149 4.072 4.67e-05 ***
## PstatusT 0.01709 0.40418 0.042 0.9663
## absences 0.09273 0.02316 4.003 6.24e-05 ***
## G3 -0.07570 0.03584 -2.112 0.0347 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 425.51 on 377 degrees of freedom
## AIC: 435.51
##
## Number of Fisher Scoring iterations: 4
The equation for the model is of form log(p/(1-p)) = a + b*sex + c*Pstatus + d*absences + e*G3 where p is the estimated probability for high alcohol use and a - e are the model coefficients. log(p/(1-p)) is the logit function of p. The coefficient values for factor variables (sex and Pstatus) are calculated separatedly for each unique factor level so that one level is chosen as the base level. The coefficients presented above denote the difference between the coefficient of the marked level and the base level. So the coefficient for the sex level F (chosen as the base level and omitted from the table) is actually the intercept -0.98, and the coefficient for the sex level M is -0.98141 + 0.98330 = 0.00189 (cf. explanation here).
The coefficients are easier to interpret if transformed to odd ratios (ORs) by exponantiation: OR = epx(b), where b is the original coefficient (OC). The odd ratio of a coefficient denotes the change of the odd p/(1-p), when the value of the explanatory variable changes by one unit. We can also use exponentiation and confint-function to calculate the confidence intervals for the odd ratios on the confidence level of 0.95.
data.frame(OR=round(exp(summary(fit1)$coef[,1]), 3),
CI=round(exp(confint(fit1)), 3),
"p-value"=round(summary(fit1)$coef[,4], 3))
## OR CI.2.5.. CI.97.5.. p.value
## (Intercept) 0.375 0.109 1.228 0.111
## sexM 2.673 1.675 4.325 0.000
## PstatusT 1.017 0.472 2.331 0.966
## absences 1.097 1.051 1.151 0.000
## G3 0.927 0.864 0.994 0.035
Interpretation of OR values:
p.p.We can check with a likelihood ratio test if the factor variables sex and Pstatus as a whole are statistically significant explanators in our fitted model.
anova(fit1, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: high_use
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 465.68
## sex 1 14.7320 380 450.95 0.0001239 ***
## Pstatus 1 0.1157 379 450.83 0.7337978
## absences 1 20.8326 378 430.00 5.012e-06 ***
## G3 1 4.4902 377 425.51 0.0340903 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that sex, absences and final grade (G3) are statistically significant explanators for alcohol use. The directions of their effects are also as I hyphothesized: probability for high alcohol usage increases with higher number of absences, decreases with better grades, and is higher for males than females. My hypothesis about parents’ cohabitation was however wrong, as cohabitation is not a statistically significant explanator in the model.
The variable Pstatus can be omitted from the model, as it is not statistically significant. Let’s refit the model:
fit2 <- glm(high_use ~ sex + absences + G3, family=binomial, data=dd)
data.frame(OR=round(exp(summary(fit2)$coef[,1]), 3),
CI=round(exp(confint(fit2)), 3),
"p-value"=round(summary(fit2)$coef[,4], 3))
## OR CI.2.5.. CI.97.5.. p.value
## (Intercept) 0.381 0.153 0.925 0.035
## sexM 2.674 1.676 4.326 0.000
## absences 1.097 1.051 1.150 0.000
## G3 0.927 0.864 0.994 0.033
anova(fit2, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: high_use
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 465.68
## sex 1 14.7320 380 450.95 0.0001239 ***
## absences 1 20.8782 379 430.07 4.894e-06 ***
## G3 1 4.5585 378 425.51 0.0327561 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looks good, but how does the model perform as a binary classificator? We can use the model to predict the probability of high alcohol consumption for the individuals in the training data, and then dicotomize these probabilities to produce a binary prediction for the outcome. For this, we need to set a cutoff probability for a positive outcome (= high alcohol consumption). Choosing a sensible cutoff is an art on its own right, but let’s simply use the treshold of 0.5 (ie. the probabilities greater than 0.5 are interpreted to indicate high alcohol consumption). This is probably not a good choice, and it would be a better idea define the cutoff value with eg. a ROC curve.
# Predicting the training data
pred <- predict(fit2, type="response")
dd$prob <- pred
pred2 <- ifelse(pred > 0.5, TRUE, FALSE)
dd$pred <- pred2
Then we can produce a confusion matrix, which is a two-way crosstabulation of observed versus predicted values for the outcome variable, and visualize it.
# Confusion matrix
conf <- table(obs=dd$high_use, pred=pred2)
addmargins(conf)
## pred
## obs FALSE TRUE Sum
## FALSE 256 12 268
## TRUE 86 28 114
## Sum 342 40 382
prop.table(conf)
## pred
## obs FALSE TRUE
## FALSE 0.67015707 0.03141361
## TRUE 0.22513089 0.07329843
# Same as a plot
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(dd, aes(x = prob, y = high_use, col = pred2))
# define the geom as points and draw the plot
g + geom_point()
From the confusion matrix we can compute that the total proportion of inaccurately classified individuals (=the training error of the model) is (86 + 12) / 382 = 0.26. Is this better than a simple guessing strategy that nobody is a high user (as it is more common to not drink heavily than to drink)?
The training error of the model:
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = dd$high_use, prob = dd$pred)
## [1] 0.2565445
The training error of the simple guessing:
# Simple guessing strategy: nobody is a high user
dd$pred3 <- F
loss_func(class = dd$high_use, prob = dd$pred3)
## [1] 0.2984293
The training error of the simple guessing strategy is only slightly bigger than the error of our model, so it seems that our model is not a very succesful predictor for alchol consumption.
Above, we assessed the model performance based on the model’s ability to correctly predict the same data as on which the model was trained. This is not a good idea, as the model generally fits the training data better than any unseen data. To avoid this overestimation, we can perform cross-validation to get a more realistic estimate of the actual predictive power of the model. Let’s perform 10-fold cross-validation, which means that we divide the data into 10 complementary parts. One part forms the testing set, and the other 9 parts another subset called training set. We perform the analysis on the training set, and then validate the results on the smaller testing set. We repeat this process so that eventually all of the data is used for both training and testing, and then calculate the mean error rate of our model as an average of all repetitions.
library(boot)
cv <- cv.glm(data = dd, cost = loss_func, glmfit = fit2, K = 10)
cv$delta[1]
## [1] 0.2696335
We can see that the cross-validated error rate of our model actually is higher than the training error of the model (0.257). The test set performance of the model is about the same as of the model fitted in the DataCamp exercise.
The dataset used in this exercise contains housing data in the suburban Boston. This is a built-in dataset in the R package MASS, and can be loaded with the data(Boston) command. Let’s check the variables and the structure of the data:
# Load Boston data
library(dplyr)
library(MASS)
data(Boston)
# Structure and dimensions
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
There are 506 rows and 14 columns in the data. The dataset is described in more details in the help file.
How does the data look like? Let’s produce summaries and a graphical overview of the variables:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
All variables are numerical/integer except chas, which is a binary dummy variable indicating the location with respect to Charles River. The variables vary in scale.
# Pairplot
pairs(Boston, cex=0.1, pch=16)
Visualization of correlations between variables:
# Correlation matrix and visualization
library(tidyr)
library(corrplot)
cor_matrix<-cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
There seems to be a number of linear associations between the variables as shown in the pairplot, eg. between the number of rooms (rm) and the median value of owner-occupied real estates (medv), and between the proportion of owner-occupied units built prior to 1940 (age) and nitrogen oxides concentration (nox), to name but a few.
Let’s standardize the dataset for further analyses. This means that we subtract the column means from the corresponding columns and divide the difference with standard deviation. After standardization, all the (scaled) variables are centered around zero, and their unit is one standard deviation.
# Scaling
bs <- as.data.frame(scale(Boston))
summary(bs)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Let’s transform the crime rate (crim) into a categorical variable. We want to cut the variable by quantiles to get the high, low and middle rates of crime into their own categories. This is achieved by using the quantiles as break points.
# Transform 'crim' to a categorical variable
bs$crim <- cut(bs$crim, breaks = quantile(bs$crim), include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# Look at the table of the new factor crim
table(bs$crim)
##
## low med_low med_high high
## 127 126 126 127
Then we need to divide the dataset to train and test sets for futher analyses. Splitting the original data to test and train sets allows us to check how well our models work. We make the division here in such a way that 80 % on the data belongs to the train set (and 20 % to the test set).
ind <- sample(nrow(bs), nrow(bs)*0.8)
train <- bs[ind,]
test <- bs[-ind,]
Linear discriminant analysis (LDA) is a classification method that can be used to
The target variable of LDA needs to be categorical.
Let’s try LDA on the newly created categorical crime rate variable (crim) to see if we can find combinations of variables that separate different crime rate classes from each other. This means that we use all the other variables in the dataset as predictor variables. The LDA is fitted on the train dataset.
# Linear discriminant analysis
set.seed(123)
lda.fit <- lda(formula=as.numeric(crim) ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(as.numeric(crim) ~ ., data = train)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2673267 0.2301980 0.2574257 0.2450495
##
## Group means:
## zn indus chas nox rm age
## 1 0.99063148 -0.9050961 -0.16296517 -0.8639671 0.40854511 -0.8952370
## 2 -0.09028089 -0.3357776 -0.01832260 -0.5961645 -0.15291520 -0.4018061
## 3 -0.39571399 0.2384871 0.18195173 0.4317147 -0.01722921 0.4291779
## 4 -0.48724019 1.0171737 -0.07348562 1.0735101 -0.37869675 0.7956921
## dis rad tax ptratio black lstat
## 1 0.8712878 -0.6979446 -0.7316013 -0.41399394 0.3781200 -0.76870358
## 2 0.4208987 -0.5373033 -0.4699552 -0.09071536 0.3113432 -0.10393455
## 3 -0.3980661 -0.3932819 -0.2763409 -0.26815160 0.1205193 0.08029325
## 4 -0.8393230 1.6375616 1.5136504 0.78011702 -0.5481859 0.82818338
## medv
## 1 0.46618631
## 2 -0.02905425
## 3 0.07473939
## 4 -0.69328787
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10227956 0.7233755877 -1.03591471
## indus 0.01414096 -0.3324760590 0.14998040
## chas -0.07288697 -0.0758892758 0.13292261
## nox 0.32581245 -0.5425731577 -1.30124379
## rm -0.09702408 -0.0344156820 -0.12054485
## age 0.30511189 -0.3060396099 -0.20909820
## dis -0.09572531 -0.2357982958 0.32970448
## rad 2.98430397 0.9236590758 -0.08131541
## tax 0.03608205 -0.0325967724 0.74589200
## ptratio 0.07837874 0.0779243451 -0.35891926
## black -0.09826692 -0.0006390227 0.10628839
## lstat 0.20805989 -0.2920412751 0.47698883
## medv 0.14196108 -0.4475367494 -0.13704067
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9394 0.0442 0.0164
We can see from the result that three linear discriminant functions were formed (number of crime categories minus one, which is the maximum possible number). These discriminant functions are linear combinations of original variables, and the coefficients for each function and variable pair can be seen in the table. From the proportion of trace we can see that the first dicriminant function LD1 explains 96 % of the between-group variance, and the other functions LD2 and LD3 explain 3.2 % and 1.1 % respectively.
We can visualize the results of LDA with a biplot. The original variables and their effects can be visualized as arrows on top of the biplot. First two LD functions are shown in the plot.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results
classes <- as.numeric(train$crim)
plot(lda.fit, dimen = 2, col = classes)
lda.arrows(lda.fit, myscale = 2, col = "#666666")
The biplot reveals us that variable rad nearly perfectly separates the high crime rate class from other classes, as there is only one observation that mixes with lower classes. The other three lower crime rate classes don’t separate properly.
We can see how well the LDA model performs as a whole by predicting new data on the test set with it. Then we can calculate the total error of prediction by crosstabulating the results against the original data, similarly as was done when predicting data with a logistic regression model. (Note to self: there’s actually no need to remove the categorical crime rate variable from test data, but I did as was instructed.)
# save and remove categorical crime data from test data
crime <- test$crim
test <- dplyr::select(test, -crim)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = as.numeric(crime), predicted = lda.pred$class)
## predicted
## correct 1 2 3 4
## 1 12 7 0 0
## 2 5 17 11 0
## 3 0 7 15 0
## 4 0 0 0 28
We can calculate from the confusion matrix that the total error rate is about 29 % in the test data. [The confusion matrix changes on every knitting because of randomization, so the error rate given here doesn’t probably match the matrix shown above!]. In other words, the model correctly classifies 71 % of new cases, which is not a bad rate compared to the expected success rate of 25 %, if randomly guessing one of four classes. The best classification performance was achieved in the high crime rate class.
We used LDA to separate and predict known classes of individual observations. But what if we don’t know the classes in advance, but want to detect similar observations and assing them to groups or clusters based on their similarity? We can use clustering methods, eg. K-means algorithm, to achieve this.
First, we need to calculate the (dis)similarity or distance of individual observations. There are many different distance measures, but let’s use the Euclidean distance, which is the “normal” or most common distance measure.
# load MASS and Boston
data('Boston')
bs2 <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(bs2)
K-means is an unsupervised clustering method that iteratively updates the cluster memberships of individual observations so that eventually similar observations belong to same clusters. We need to decide the number of clusters we want to have before running the K-means algorithm. Let’s start with 5 clusters:
km <-kmeans(dist_eu, centers = 5)
But wait! How do we know the optimal number of clusters? One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes.
# determine the initial number of clusters
k_max <- 10
# calculate the total within sum of squares
set.seed(123)
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
The optimal number of clusters is when the total WCSS drops radically, so let’s rerun K-means with 2 clusters. We can visualize the results with a pairplot, where the clusters are separated with colors.
# Optimal number of clusters is when the total of within cluster sum of squares drops drastically
# -> 2 clusters
km <-kmeans(dist_eu, centers = 2)
# plot the scaled Boston dataset with clusters
pairs(bs, col = km$cluster, cex=0.1, pch=16)
We can see from the pairplot that rad and tax are the two most informative variables to separate the two clusters from each other. Also nox seems to contain useful information for clustering.
K-means clusters as targets of LDA
# k-means again
set.seed(123)
km2 <- kmeans(dist_eu, centers = 5)
# LDA with using the k-means clusters as target classes
bs2$cl <- km2$cluster
lda.fit2 <- lda(cl ~ ., data = bs2)
plot(lda.fit2, col=as.numeric(bs2$cl), dimen=2)
lda.arrows(lda.fit2, myscale = 3, col = "#666666")
Color is defined by the categorical crime rate:
model_predictors <- dplyr::select(train, -crim)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crim)
Color is defined by the clusters of the k-means:
train$cl <- bs2$cl[match(rownames(train), rownames(bs2))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cl))
The data used in this exercise is taken from the Human Development Report 2015 (see more info here).
Two different datasets, one on human development and another on gender inequality, were combined in the data wrangling exercise to include the following variables:
Observations (countries) with missing values were removed, and the manipulated data consist of 155 countries.
Let’s load the data.
# Load human data
library(dplyr)
library(corrplot)
library(tidyr)
load("./data/human.RData")
How does the data look like? Let’s produce summaries and a graphical overview of the variables:
# Structure and dimensions
str(human4)
## 'data.frame': 155 obs. of 8 variables:
## $ edu2FMrat: num 1.007 0.997 0.983 0.989 0.969 ...
## $ labFMrat : num 0.891 0.819 0.825 0.884 0.829 ...
## $ lifeexp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ expedu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : num 64992 42261 56431 44025 45435 ...
## $ matmor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adbi : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ repparl : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# Variable summaries and graphical overview
summary(human4)
## edu2FMrat labFMrat lifeexp expedu
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI matmor adbi repparl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
library(GGally)
library(ggplot2)
p <- ggpairs(human4, mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
All the variables are numerical, but have varying scales, eg. edu2FMrat and labFMrat are ratios, repparl is a percentage, and GNI is a sum index. There are several strong correlations between variables, eg. a positive correlation of 0.76 beween adbi (adult birth rate) and lifeexp (life expectancy at birth), and a negative correlation of -0.86 between matmor (maternal mortality ratio) and lifeexp.
Principal component analysis (PCA) is a dimension reduction technique that seeks to transform the data to principal components that are linear combinations of original variables. Let’s first try PCA on unstandardized data.
# PCA on unstandardized data
pca_human <- prcomp(human4)
pca_human
## Standard deviations:
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation:
## PC1 PC2 PC3 PC4
## edu2FMrat -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
## labFMrat 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
## lifeexp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
## expedu -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
## matmor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
## adbi 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
## repparl -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## edu2FMrat -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
## labFMrat 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
## lifeexp 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
## expedu 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
## GNI -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## matmor 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
## adbi 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
## repparl -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
Doesn’t look very good, as there only seems to be one important variable.This is because PCA assumes that a large variance means an important variable. The variable GNI (gross national income per capita) has a variance much larger than the other variables, and thus gets way too much weight in PCA. We need to standardize the data to scale down the effect of different-sized variances, and redo PCA.
# Standardize and repeat
human_std <- scale(human4)
summary(human_std)
## edu2FMrat labFMrat lifeexp expedu
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI matmor adbi repparl
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations:
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation:
## PC1 PC2 PC3 PC4 PC5
## edu2FMrat -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## labFMrat 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## lifeexp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## expedu -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## matmor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## adbi 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## repparl -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## edu2FMrat 0.17713316 0.05773644 0.16459453
## labFMrat -0.03500707 -0.22729927 -0.07304568
## lifeexp -0.42242796 -0.43406432 0.62737008
## expedu -0.38606919 0.77962966 -0.05415984
## GNI 0.11120305 -0.13711838 -0.16961173
## matmor 0.17370039 0.35380306 0.72193946
## adbi -0.76056557 -0.06897064 -0.14335186
## repparl 0.13749772 0.00568387 -0.02306476
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "darkviolet"))
The standardization helped, and now the results look more reasonable. The first principal component explains 53 % of variation, and the first three principal components cumulatively explain almost 80 % of the variation in our data. So we have effectively reduced the number of variables needed to characterize the data set. Moreover, the principal components are always orthogonal (ie. don’t correlate with each other), which is handy if we would like to perform further analyses, e.g. linear regression, on them.
The first principal component (PC1) is a linear combination of other variables than female representation in labour force and in parliament (labFMrat and repparl, respectively), as these two variables are the only ones to have low loadings on PC1 (ie. don’t correlate with PC1). On the other hand, these two variables have high loadings on PC2, whereas other variables barely correlate with PC2. Female representation both in labour force and parliament positively correlates with PC2, so we can interpret PC2 to represent gender equality (most gender-equal countries are located on top of the biplot). Female participation in secondary education (edu2FMrat), life expectation (lifeexp), expected years of education (expedu) and gross national income per capita (GNI) have negative loadings on PC1, whereas maternal mortality rate (matmor) and adolescent birth rate (adbi) have positive loadings on it. This seems to imply that high value of PC1 (location on right of the biplot) signifies low development level of a country, combining social, educational and economic aspects of development.
Next, we try another dimension reduction method on tea data from the package FactoMineR. This dataset includes 300 observations of people’s tea drinking habits.
# Load tea dataset
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
The dataset includes almost exclusively factor variables, so we can’t now perform PCA, which requires numerical variables. There is a very high number of variables in the data, so let’s only choose a subset of them for further analysis.
library(tidyr)
tea2 <- tea[c("breakfast","tea.time","friends","Tea","How","sugar","sex","sophisticated")]
str(tea2)
## 'data.frame': 300 obs. of 8 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ sophisticated: Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
gather(tea2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Let’s perform a multiple correspondence analysis (MCA) on these 8 variables. MCA is a generalization of PCA that can be used to visualize categorical data on Euclidean space.
# MCA
res.mca <- MCA(tea2, graph=FALSE)
summary(res.mca, nbelements=Inf, nbind=5)
##
## Call:
## MCA(X = tea2, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.194 0.175 0.170 0.136 0.120 0.115
## % of var. 14.076 12.747 12.366 9.927 8.740 8.337
## Cumulative % of var. 14.076 26.823 39.189 49.116 57.855 66.192
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.108 0.104 0.095 0.089 0.069
## % of var. 7.887 7.533 6.905 6.446 5.037
## Cumulative % of var. 74.079 81.612 88.517 94.963 100.000
##
## Individuals (the 5 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.796 1.092 0.393 | 0.611 0.709 0.231 | 0.046
## 2 | 0.259 0.115 0.035 | 1.012 1.947 0.538 | 0.238
## 3 | -0.479 0.395 0.246 | -0.172 0.056 0.032 | -0.074
## 4 | 0.687 0.812 0.465 | -0.406 0.313 0.162 | 0.132
## 5 | 0.563 0.546 0.247 | 0.338 0.218 0.089 | 0.111
## ctr cos2
## 1 0.004 0.001 |
## 2 0.111 0.030 |
## 3 0.011 0.006 |
## 4 0.034 0.017 |
## 5 0.024 0.010 |
##
## Categories
## Dim.1 ctr cos2 v.test Dim.2 ctr
## breakfast | -0.023 0.016 0.000 -0.380 | 0.656 14.735
## Not.breakfast | 0.021 0.015 0.000 0.380 | -0.606 13.601
## Not.tea time | 0.793 17.748 0.488 12.077 | -0.226 1.587
## tea time | -0.615 13.757 0.488 -12.077 | 0.175 1.230
## friends | -0.302 3.855 0.172 -7.175 | -0.181 1.522
## Not.friends | 0.570 7.265 0.172 7.175 | 0.341 2.868
## black | -0.151 0.361 0.007 -1.490 | 1.070 20.158
## Earl Grey | -0.079 0.259 0.011 -1.835 | -0.308 4.341
## green | 0.800 4.546 0.079 4.863 | -0.601 2.838
## alone | -0.057 0.136 0.006 -1.339 | -0.298 4.105
## lemon | -0.104 0.077 0.001 -0.632 | -0.401 1.262
## milk | 0.375 1.906 0.037 3.342 | 0.907 12.326
## other | -1.011 1.981 0.032 -3.075 | 1.567 5.256
## No.sugar | -0.431 6.199 0.199 -7.706 | 0.225 1.866
## sugar | 0.461 6.627 0.199 7.706 | -0.241 1.995
## F | -0.582 12.993 0.495 -12.162 | -0.133 0.749
## M | 0.850 18.957 0.495 12.162 | 0.194 1.093
## Not.sophisticated | 0.360 2.367 0.051 3.910 | 0.548 6.069
## sophisticated | -0.142 0.936 0.051 -3.910 | -0.217 2.399
## cos2 v.test Dim.3 ctr cos2 v.test
## breakfast 0.397 10.899 | -0.236 1.965 0.051 -3.920 |
## Not.breakfast 0.397 -10.899 | 0.218 1.814 0.051 3.920 |
## Not.tea time 0.039 -3.436 | 0.192 1.181 0.029 2.920 |
## tea time 0.039 3.436 | -0.149 0.915 0.029 -2.920 |
## friends 0.062 -4.290 | -0.317 4.828 0.189 -7.526 |
## Not.friends 0.062 4.290 | 0.598 9.099 0.189 7.526 |
## black 0.375 10.592 | 0.491 4.378 0.079 4.862 |
## Earl Grey 0.171 -7.143 | -0.415 8.162 0.311 -9.647 |
## green 0.045 -3.656 | 1.328 14.256 0.218 8.072 |
## alone 0.164 -7.012 | 0.326 5.083 0.198 7.685 |
## lemon 0.020 -2.438 | -1.340 14.510 0.222 -8.143 |
## milk 0.219 8.088 | -0.351 1.899 0.033 -3.127 |
## other 0.076 4.766 | 0.300 0.199 0.003 0.913 |
## No.sugar 0.054 4.023 | 0.541 11.122 0.313 9.674 |
## sugar 0.054 -4.023 | -0.578 11.889 0.313 -9.674 |
## F 0.026 -2.780 | 0.077 0.260 0.009 1.612 |
## M 0.026 2.780 | -0.113 0.379 0.009 -1.612 |
## Not.sophisticated 0.119 5.958 | -0.527 5.779 0.110 -5.727 |
## sophisticated 0.119 -5.958 | 0.208 2.285 0.110 5.727 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.000 0.397 0.051 |
## tea.time | 0.488 0.039 0.029 |
## friends | 0.172 0.062 0.189 |
## Tea | 0.080 0.383 0.364 |
## How | 0.063 0.322 0.295 |
## sugar | 0.199 0.054 0.313 |
## sex | 0.495 0.026 0.009 |
## sophisticated | 0.051 0.119 0.110 |
The first two dimensions retain 27 % of variance in the data. The v-test values show that most categories have significant (abs(v.test) > 1.96) coordinates in the first three dimensions. Variables sex and tea time have highest correlations with the dimension 1, and variables breakfast, Tea and How have highest correlations with the dimension 2.
We can visualize the results with biplots using different options.
# MCA
plot(res.mca, invisible = c("var"))
plot(res.mca, invisible = c("ind"), habillage="quali")
The second biplot shows the individual categories plotted on (Euclidean) MCA space formed by first two MCA dimensions. Categories that are mutually similar are positioned close to each other in the biplot. We can see one clear outlier category, those who choose to drink their tea with “other” ways, on top of the plot. However, there are very few cases in this category, as could be seen on the barplot that was plotted earlier.
In the second biplot, dimension 1 separates females and males to different groups, as the category summary above implied. Females seem to prefer to drink their tea at tea time, with friends and with no sugar, whereas men are not that particular about drinking tea at tea time or with friends, and more often use sugar. Dimension 2 separates those drinking their tea on breakfast and those who do not drink tea on breakfast. The former group prefers drinking black tea with milk, whereas the latter chooses green tea or Earl Grey (weird! isn’t Earl Grey a black tea variant?) alone or with lemon.